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Abstract — On the heels of compressed sensing, a re- 
markable new field has very recently emerged. This field 
addresses a broad range of problems of significant practical 
interest, namely, the recovery of a data matrix from what 
appears to be incomplete, and perhaps even corrupted, 
information. In its simplest form, the problem is to recover 
a matrix from a small sample of its entries, and comes 
up in many areas of science and engineering including 
collaborative filtering, machine learning, control, remote 
sensing, and computer vision to name a few. 

This paper surveys the novel literature on matrix com- 
pletion, which shows that under some suitable conditions, 
one can recover an unknown low-rank matrix from a 
nearly minimal set of entries by solving a simple convex 
optimization problem, namely, nuclear-norm minimization 
subject to data constraints. Further, this paper introduces 
novel results showing that matrix completion is provably 
accurate even when the few observed entries are corrupted 
with a small amount of noise. A typical result is that one 
can recover an unknown n x n matrix of low rank r from 
just about nr log 2 n noisy samples with an error which 
is proportional to the noise level. We present numerical 
results which complement our quantitative analysis and 
show that, in practice, nuclear norm minimization accu- 
rately fills in the many missing entries of large low-rank 
matrices from just a few noisy samples. Some analogies 
between matrix completion and compressed sensing are 
discussed throughout. 

Keywords. Matrix completion, low-rank matrices, 
semidefinite programming, duality in optimization, 
nuclear-norm minimization, oracle inequalities, com- 
pressed sensing. 

I. Introduction 

Imagine that we only observe a few samples of a 
signal. Is it possible to reconstruct this signal exactly 
or at least accurately? For example, suppose we observe 
a few entries of a vector x € R™, which we can think 
of as a digital signal or image. Can we recover the large 
fraction of entries — of pixels if you will — that we have 
not seen? In general, everybody would agree that this is 
impossible. However, if the signal is known to be sparse 
in the Fourier domain and, by extension, in an incoherent 
domain, then accurate — and even exact — recovery is 
possible by l\ minimization [9], see also [20] for other 
algorithms, [15], [16] for other types of measurements, 
and [31] for different ideas. This revelation is at the root 



of the rapidly developing field of compressed sensing, 
and is already changing the way engineers think about 
data acquisition, hence this special issue and others, see 
[2] for example. Concretely, if a signal has a sparse 
frequency spectrum and we only have information about 
a few time or space samples, then one can invoke linear 
programming to interpolate the signal exactly. One can 
of course exchange time (or space) and frequency, and 
recover sparse signals from just a few of their Fourier 
coefficients as well. 

Imagine now that we only observe a few entries of a 
data matrix. Then is it possible to accurately — or even 
exactly — guess the entries that we have not seen? For 
example, suppose we observe a few movie ratings from 
a large data matrix in which rows are users and columns 
are movies (we can only observe a few ratings because 
each user is typically rating a few movies as opposed to 
the tens of thousands of movies which are available). Can 
we predict the rating a user would hypothetically assign 
to a movie he/she has not seen? In general, everybody 
would agree that recovering a data matrix from a subset 
of its entries is impossible. However, if the unknown 
matrix is known to have low rank or approximately low 
rank, then accurate and even exact recovery is possible 
by nuclear norm minimization [8], [12]. This revelation, 
which to some extent is inspired by the great body of 
work in compressed sensing, is the subject of this paper. 

From now on, we will refer to the problem of infer- 
ring the many missing entries as the matrix completion 
problem. By extension, inferring a matrix from just a few 
linear functionals will be called the the low-rank matrix 
recovery problem. Now just as sparse signal recovery 
is arguably of paramount importance these days, we do 
believe that matrix completion and, in general, low-rank 
matrix recovery is just as important, and will become 
increasingly studied in years to come. For now, we give 
a few examples of applications in which these problems 
do come up. 

• Collaborative filtering. In a few words, collabora- 
tive filtering is the task of making automatic predic- 
tions about the interests of a user by collecting taste 
information from many users [21]. Perhaps the most 
well-known implementation of collaborating filter- 



ing is the Netflix recommendation system alluded 
to earlier, which seeks to make rating predictions 
about unseen movies. This is a matrix completion 
problem in which the unknown full matrix has 
approximately low rank because only a few fac- 
tors typically contribute to an individual's tastes 
or preferences. In the new economy, companies 
are interested predicting musical preferences (Apple 
Inc.), literary preferences (Amazon, Barnes and 
Noble) and many other such things. 

• System identification. In control, one would like to 
fit a discrete-time linear time-invariant state-space 
model 

x(t + l) = Ax(t) + Bu(t), 
y(t) = Cx(t) + Du(t) 

to a sequence of inputs u(t) E W n and outputs 
y{t) S R p , t — 0,...,N. The vector x(t) G R n 
is the state of the system at time t, and n is the 
order of the system model. From the input/output 
pair {(u(t),y(t)) : t = 0,...N}, one would like 
to recover the dimension of the state vector n (the 
model order), and the dynamics of the system, 
i.e. the matrices A, B, C, D, and the initial state 
x(0). This problem can be cast as a low -rank matrix 
recovery problem, see [23] and references therein. 

• Global positioning. Finding the global positioning 
of points in Euclidean space from a local or partial 
set of pairwise distances is a problem in geometry 
that emerges naturally in sensor networks [6], [28], 
[29]. For example, because of power constraints, 
sensors may only be able to construct reliable 
distance estimates from their immediate neighbors. 
From these estimates, we can form a partially 
observed distance matrix, and the problem is to 
infer all the pairwise distances from just a few 
observed ones so that locations of the sensors can 
be reliably estimated. This reduces to a matrix 
completion problem where the unknown matrix is 
of rank two if the sensors are located in the plane, 
and three if they are located are in space. 

• Remote sensing. The MUSIC algorithm [27] is 
frequently used to determine the direction of arrival 
of incident signals in a coherent radio-frequency 
environment. In a typical application, incoming 
signals are being recorded at various sensor lo- 
cations, and this algorithm operates by extracting 
the directions of wave arrivals from the covariance 
matrix obtained by computing the correlations of 
the signals received at all sensor pairs. In remote 
sensing applications, one may not be able to esti- 



mate or transmit all correlations because of power 
constraints [32]. In this case, we would like to infer 
a full covariance matrix from just a few observed 
partial correlations. This is a matrix completion 
problem in which the unknown signal covariance 
matrix has low rank since it is equal to the number 
of incident waves, which is usually much smaller 
than the number of sensors. 
There are of course many other examples including the 
structure-from-motion problem [13], [30] in computer 
vision, multi-class learning in data analysis [3], [4], and 
so on. 

This paper investigates whether or not one can recover 
low rank matrices from fewer entries, and if so, how 
and how well. In Section [II] we will study the noiseless 
problem in which the observed entries are precisely those 
of the unknown matrix. Section [III] examines the more 
common situation in which the few available entries are 
corrupted with noise. We complement our study with a 
few numerical experiments demonstrating the empirical 
performance of our methods in Section |IV] and conclude 
with a short discussion (Section \V\ . 

Before we begin, it is best to provide a brief summary 
of the notations used throughout the paper. We shall use 
three norms of a matrix X S jj™i x ™2 w ith singular 
values {(Jfc}. The spectral norm is denoted by ||X|| 
and is the largest singular value. The Euclidean inner 
product between two matrices is defined by the formula 
(X,Y) := tra,ce(X*Y), and the corresponding Eu- 
clidean norm is called the Frobenius norm and denoted 
by ||X||iT (note that this is the li norm of the vector 
of singular values). The nuclear norm is denoted by 
11-^11* := J2k a k an d i s me sum °f singular values (the 
t\ norm of the vector {cfe}). As is standard, X y Y 
means that X — Y is positive semidefinite. 

Further, we will also manipulate linear transformation 
which acts on the space ]R" lX " 2 , and we will use 
calligraphic letters for these operators as in A(X). In 
particular, the identity operator on this space will be 
denoted by 1 : R n ^ xn ^ -> R n ^ xn \ We use the same 
convention as above, and A y T means that A — T 
(seen as a big matrix) is positive semidefinite. 

We use the usual asymptotic notation, for instance 
writing O(M) to denote a quantity bounded in mag- 
nitude by CM for some absolute constant C > 0. 

II. Exact Matrix Completion 
From now on, M £ K™i x "2 j s a ma tn X we would 
like to know as precisely as possible. However, the only 
information available about M is a sampled set of entries 
Mij, € ri, where f2 is a subset of the complete 



2 



set of entries [m] x [n 2 \. (Here and in the sequel, [n] 
denotes the list {1, ...,n}.) It will be convenient to 
summarize the information available via Vq(M), where 
the sampling operator V n : K™i x ™2 -> r iX " 2 is 
defined by 



[Pn(X)]i 



Xij, (i, fie SI, 



0. 



otherwise. 



Thus, the question is whether it is possible to recover 
our matrix only from the information Vq(M). We will 
assume that the entries are selected at random without 
replacement as to avoid trivial situations in which a 
row or a column is unsampled, since matrix completion 
is clearly impossible in such cases. (If we have no 
data about a specific user, how can we guess his/her 
preferences? If we have no distance estimates about a 
specific sensor, how can we guess its distances to all the 
sensors?) 

Even with the information that the unknown matrix 
M has low rank, this problem may be severely ill posed. 
Here is an example that shows why: let ibe a vector in 
W L and consider the n x n rank-1 matrix 



M = e\x* = 



Xl x 2 x 3 









%n— 1 %n 

o o 













where e\ is the first vector in the canonical basis of R n . 
Clearly, this matrix cannot be recovered from a subset of 
its entries. Even if one sees 95% of the entries sampled 
at random, then we will miss elements in the first row 
with very high probability, which makes the recovery 
of the vector x, and by extension of M, impossible. 
The analogy in compressed sensing is that one obviously 
cannot recover a signal assumed to be sparse in the time 
domain, by subsampling in the time domain! 

This example shows that one cannot hope to complete 
the matrix if some of the singular vectors of the matrix 
are extremely sparse — above, one cannot recover M 
without sampling all the entries in the first row, see [8] 
for other related pathological examples. More generally, 
if a row (or column) has no relationship to the other 
rows (or columns) in the sense that it is approximately 
orthogonal, then one would basically need to see all 
the entries in that row to recover the matrix M. Such 
informal considerations led the authors of [8] to intro- 
duce a geometric incoherence assumption, but for the 
moment, we will discuss an even simpler notion which 
forces the singular vectors of M to be spread across all 



coordinates. To express this condition, recall the singular 
value decomposition (SVD) of a matrix of rank r, 



M - 



^2 a kU k v* k , 

k£[r] 



(II. 1) 



in which a\ , . . . , ay > are the singular values, and 
Ui,...,u r G K™ 1 , Vi,...,v r G M™ 2 are the singular 
vectors. Our assumption is as follows: 

||wfe|Uoo ^ VvB/ni, \\vkWicc < y/JiB/n, (II.2) 

for some fi B > 1, where the i x norm is of course 
defined by H^H^ = maxj \xi\. We think of [i B as being 
small, e.g. 0(1), so that the singular vectors are not too 
spiky as explained above. 

If the singular vectors of M are sufficiently spread, 
the hope is that there is a unique low-rank matrix which 
is consistent with the observed entries. If this is the case, 
one could, in principle, recover the unknown matrix by 
solving 



minimize 
subject to 



rank(X) 

Vn(X) = Vn(M), 



(113) 



where X E R™i x ™2 j s tne decision variable. Unfortu- 
nately, not only is this algorithm NP-hard, but all known 
algorithms for exactly solving it are doubly exponential 
in theory and in practice [14]. This is analogous to 
the intractability of € -minimization in sparse signal 
recovery. 

A popular alternative is the convex relaxation [8], [12], 
[17], [19], [26] 



minimize 
subject to 



11*11. 

Va(X) = V a (M), 



(II.4) 



(see [5], [25] for the earlier related trace heuristic). Just 
as t\ -minimization is the tightest convex relaxation of 
the combinatorial £ - mm i m i za tion problem in the sense 
that the t\ ball of R" is the convex hull of unit-normed 
1 -sparse vectors (i.e. vectors with at most one nonzero 
entry), nuclear-norm minimization is the tightest convex 
relaxation of the NP-hard rank minimization problem. 
To be sure, the nuclear ball {X € M"i x ™ 2 : \\X\\ t < 1} 
is the convex hull of the set of rank-one matrices with 
spectral norm bounded by one. Moreover, in compressed 
sensing, £\ minimization subject to linear equality con- 
straints can be cast as a linear program (LP) for the 
£i norm has an LP characterization: indeed for each 
x e R n , \\x\\ ei is the optimal value of 



maximize 
subject to 



(u,x) 

IMI*oo ^ !. 
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with decision variable u £ R". In the same vein, the nu- 
clear norm of X £ R™i xn 2 nas tne grjp characterization 



maximize 
subject to 



(W,X) 
\\W\\ < 1, 



cn.5) 



with decision variable W £ M™ lX ™ 2 . This expresses the 
fact that the spectral norm is dual to the nuclear norm. 
The constraint on the spectral norm of W is an SDP 
constraint since it is equivalent to 



W* I n2 



ho, 



where /„ is the n X n identity matrix. Hence, flll.4) is 
an SDP, which one can express by writing ||X||* as the 
optimal value of the SDP dual to ( |II.5| ). 

In [12], it is proven that nuclear-norm minimization 
succeeds nearly as soon as recovery is possible by any 
method whatsoever. 

Theorem 1: [12] Let M £ W 11 * 112 be a fixed 
matrix of rank r = 0(1) obeying ( |H.2| i and set n := 
max(m, n 2 ). Suppose we observe m entries of M with 
locations sampled uniformly at random. Then there is a 
positive numerical constant C such that if 



rn > C n B n log 



m.6) 



then M is the unique solution to ( |II.4| i with probability 
at least 1 — n~ 3 . In other words: with high probability, 
nuclear-norm minimization recovers all the entries of M 
with no error. 

As a side remark, one can obtain a probability of 
success at least 1 — n^ 13 for j3 by taking C in ( |II.6| l 
of the form C/3 for some universal constant C. 

An n% x n 2 matrix of rank r depends upon r{n\ + 
n>2 — r) degrees of freedom^] When r is small, the 
number of degrees of freedom is much less than nxn 2 
and this is the reason why subsampling is possible. (In 
compressed sensing, the number of degrees of freedom 
corresponds to the sparsity of the signal; i.e. the number 
of nonzero entries.) What is remarkable here, is that 
exact recovery by nuclear norm minimization occurs as 
soon as the sample size exceeds the number of degrees 
of freedom by a couple of logarithmic factors. Further, 
observe that if £1 completely misses one of the rows 
(e.g. one has no rating about one user) or one of the 
columns (e.g. one has no rating about one movie), then 
one cannot hope to recover even a matrix of rank 1 of 
the form M = xy* . Thus one needs to sample every 
row (and also every column) of the matrix. When £1 is 
sampled at random, it is well established that one needs 

1 This can be seen by counting the degrees of freedom in the singular 
value decomposition. 



at least on the order O(nlogn) for this to happen as 
this is the famous coupon collector's problem. Hence, 
( |II.6| l misses the information theoretic limit by at most 
a logarithmic factor. 

To obtain similar results for all values of the rank, 
[12] introduces the strong incoherence property with 
parameter /i stated below. 

Al Let Pjj (resp. P v ) be the orthogonal pro- 
jection onto the singular vectors Ux,...,u r 
(resp. Vi, . . ., v r ). For all pairs (a, a') £ [m] x 
[ni] and (b, b') £ [n 2 ] x [n 2 ], 



(e b ,P v e b > 



m 

r 

n 2 



^-a—a' 
■lb=6' 



< 



/'" 



ni 
n 2 ' 



A2 Let E be the "sign matrix" defined by 

E = u kv* k - 
fc6[r] 

For all (a,b) £ [m] x [n 2 ], 

\Eab\ < M" 



(II.7) 



s/nin 2 

These conditions do not assume anything about the 
singular values. As we will see, incoherent matrices with 
a small value of the strong incoherence parameter fi can 
be recovered from a minimal set of entries. Before we 
state this result, it is important to note that many model 
matrices obey the strong incoherence property with a 
small value of \i. 

• Suppose the singular vectors obey ( |H.2| > with /is = 
O(l) (which informally says that the singular vec- 
tors are not spiky), then with the exception of a 
very few peculiar matrices, M obeys the strong 
incoherence property with /1 = O(y/\ogn). 

• Assume that the column matrices [ui,..., u r ] and 
[vx,..., v r ] are independent random orthogonal ma- 
trices, then with high probability, Al obeys the 
strong incoherence property with [i = 0{\f log n), 
at least when r > logn as to avoid small samples 
effects. 

The sampling result below is general, nonasymptotic 
and optimal up to a few logarithmic factors. 

Theorem 2: [12] With the same notations as in The- 
orem [T] there is a numerical constant C such that if 



m > C n nr log 



(II.8) 



M is the unique solution to ( |H.4| i with probability at 
least 1 — n~ 3 . 
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In other words, if a matrix is strongly incoherent and 
the cardinality of the sampled set is about the number of 
degrees of freedom times a few logarithmic factors, then 
nuclear-norm minimization is exact. This improves on 
an earlier result of Candes and Recht [8] who proved — 
under slightly different assumptions — that on the order 
of n 6 / 5 r log n samples were sufficient, at least for values 
of the rank obeying r < n 1 / 5 . 

We would like to point out a result of a broadly 
similar nature, but with a completely different recovery 
algorithm and with a somewhat different range of appli- 
cability, which was recently established by Keshavan, 
Oh, and Montanari [22]. Their conditions are related 
to the incoherence property introduced in [8], and are 
also satisfied by a number of reasonable random matrix 
models. There is, however, another condition which 
states that the singular values of the unknown matrix 
cannot be too large or too small (the ratio between the 
top and lowest value must be bounded). This algorithm 
1) trims each row and column with too few entries; 
i.e replaces the entries in those rows and columns by 
zero and 2) computes the SVD of the trimmed matrix 
and truncate it as to only keep the top r singular values 
(note that the value of r is needed here). The result is 
that under some suitable conditions discussed above, this 
recovers a good approximation to the matrix M provided 
that the number of samples be on the order of nr. The 
recovery is not exact but only approximate although the 
authors have announced that one could add extra steps 
to the algorithm to provide an exact recovery if one has 
more samples (on the order of nrlogn). At the time of 
this writing, such results are not yet publicly available. 



A. Geometry and dual certificates 

We cannot possibly rehash the proof of Theorem [2] 
from [12] in this paper, or even explain the main techni- 
cal steps, because of space limitations. We will, however, 
detail sufficient and almost necessary conditions for the 
low-rank matrix M to be the unique solution to the SDP 
(jUA) . This will be useful to establish stability results. 

The recovery is exact if the feasible set is tangent to 
the nuclear ball at the point M, see Figure [T] which rep- 
resents the set of points (a;, y, z) E R 3 such that the 2x2 
x y 



symmetric matrix " has nuclear norm bounded by 
[y z\ 

one. To express this mathematically^] standard duality 
theory asserts that M is a solution to (|II.4[) if and only 



2 In general, M minimizes the nuclear norm subject to the linear 
constraints A{X) = b, A : M ni x " 2 -> K m , if and only if there is 
A 6 M m such that A*(\) £ d||M||„. 




Fig. 1: The blue shape (courtesy of B. Recht) 
represents the nuclear ball (see the main text), 
and the plane the feasible set (courtesy . 



if there exists a dual matrix A such that Pq(A) is a 
subgradient of the nuclear norm at M, written as 



Pn(A) € d\\M\ 



(H.9) 



Recall the SVD ( |II.1| ) of M and the "sign matrix" E 
( |H.7| i. It is is well-known that Z S 0||M||» if and only 
if Z is of the form, 



Z = E+W, (11.10) 



where 



P v W = 0, WP v = 0, ||W||<1. (11.11) 

In English, Z is a subgradient if it can be decomposed as 
the sign matrix plus another matrix with spectral norm 
bounded by one, whose column (resp. row) space is or- 
thogonal to the span of Ui, . . . , u r , (resp. of V\, . . . , v r ). 
Another way to put this is by using notations introduced 
in [8]. Let T be the linear space spanned by elements 
of the form utx* and yvt, k € [r], and let T 1 - be the 
orthogonal complement to T. Note that T 1 - is the set 
of matrices obeying PuW = and WPv = 0. Then, 
Z e 0||M||, if and only if 

z = e + v t ±(z) 1 \\r T x{z)\\ < i. 

This motivates the following definition. 

Definition 3 (Dual certificate): We say that A is a 
dual certificate if A is supported on £1 (A = Vq(A)), 
Vt(A) = E and ||7V(A)|| < 1. 

Before continuing, we would like to pause to observe 
the relationship with l\ minimization. The point x* £ 
K n is solution to 



minimize 
subject to 



Ax = b, 



(11.12) 



5 



with A G M. mxn if and only if there exists A G K m such 
that A*X G Note that if S* is the support of 

x*, z £ d\\x*\\e 1 is equivalent to 

Jsgn«), i G S\ 
z = e + w, e — < 

10, i£S\ 



and 



Wi = for all i G 5, lltulU^ < 1. 



Hence, there is a clear analogy and one can think of T 
defined above as playing the role of the support set in 
the sparse recovery problem. 

With this in place, we shall make use of the following 
lemma from [8]: 

Lemma 4: [8] Suppose there exists a dual certificate 
A and consider any H obeying Vn{H) = 0. Then 

\\M + H\U > \\M\U - (1 - ||7V(A)||)||7V(#)II*. 

Proof: For any Z G 9||M||*, we have 

|| M + H\\» > ||M||* + (Z,H). 

With A = E + V T x (A) and Z = E + V T ± (Z), we have 

\\M + H\\* > ||M||* + (A,H) + (V T ±(Z-A),H) 



\M\\ 



{Z-A,V T x(H)) 



since Vn(H) = 0. Now we use the fact that 
the nuclear and spectral norms are dual to one an- 
other. In particular, there exists \\Z\\ < 1 such that 
(Z,V T x{H)) = \\V T x(H)\\, and \{A,V T x(H))\ = 
\(V T x(A),V T x(H))\ < \\V T ±(A)\\\\P T ±(H)\\,. There- 
fore, 

||M + H\\ m > \\M\U + (1 - ||P T x (A) 11)11^(1011.. 

which concludes the proof. ■ 
A consequence of this lemma are the sufficient con- 
ditions below. 

Lemma 5: [8] Suppose there exists a dual certifi- 
cate obeying IjT^i (F) || < 1 and that the restriction 
Vn [ T - T -> Vn0& nxn ) of the (sampling) operator Vn 
restricted to T is injective. Then M is the unique solution 
to the convex program ( |H.4| >. 

Proof: Consider any feasible perturbation M + H 
obeying Vn(H) = 0. Then by assumption, Lemma [4] 
gives 

||M + JT||, > ||M||, 

unless V T x(H) = 0. Assume then that V T x{H) = 0; 
that is to say, H G T. Then Vq(H) = implies that 
H = by the injectivity assumption. The conclusion 
is that M is the unique minimizer since any nontrivial 
perturbation increases the nuclear norm. ■ 



The methods for proving that matrix completion by 
nuclear minimization is exact, consist in constructing a 
dual certificate. 

Theorem 6: [12] Under the assumptions of either 
Theorem [T] or Theorem [2] there exists a dual certifi- 
cate obeying HPju^A)!! < 1/2. In addition, if p = 
m/(nin 2 ) is the fraction of observed entries, the op- 
erator VtVqVt ■ T — > T is one-to-one and obeys 



| J <V T VnP T < y X > 



(11.13) 



where X : T — * T is the identity operator. 

The second part, namely, ( |H.13| > shows that the map- 
ping "Pa : T — > M"i x ™ 2 is injective. Hence, the sufficient 
conditions of Lemma [5] are verified, and the recovery 
is exact. What is interesting, is that the existence of a 
dual certificate together with the near-isometry ( |II.13| l — 
in fact, the lower bound — are sufficient to establish the 
robustness of matrix completion vis a vis noise. 

III. Stable Matrix Completion 

In any real world application, one will only observe a 
few entries corrupted at least by a small amount of noise. 
In the Netflix problem, users' ratings are uncertain. In 
the system identification problem, one cannot determine 
the locations y(t) with infinite precision. In the global 
positioning problem, local distances are imperfect. And 
finally, in the remote sensing problem, the signal covari- 
ance matrix is always modeled as being corrupted by 
the covariance of noise signals. Hence, to be broadly 
applicable, we need to develop results which guarantee 
that reasonably accurate matrix completion is possible 
from noisy sampled entries. This section develops novel 
results showing that this is, indeed, the case. 

Our noisy model assumes that we observe 



Y 



Mi. 



(id) e n, 



(iii.i) 



where {Zij : G fi} is a noise term which may be 
stochastic or deterministic (adversarial). Another way to 
express this model is as 

Vn(Y) = Tn(M) + Vn(Z), 

where Z is an n x n matrix with entries Z^ for S 
(note that the values of Z outside of are irrelevant). 
All we assume is that ||7 7 n(^)||F < 5 f° r some S > 0. 
For example, if {Z^} is a white noise sequence with 
standard deviation a, then S 2 < (m + %/8to)ct 2 with 
high probability, say. To recover the unknown matrix, 
we propose solving the following optimization problem: 



minimize 
subject to 



\\Vn(X-Y)\\ F <5 



(111.2) 
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Among all matrices consistent with the data, find the one 
with minimum nuclear norm. This is also an SDP, and 
let M be the solution to this problem. 

Our main result is that this reconstruction is accurate. 

Theorem 7: With the notations of Theorem|6| suppose 
there exists a dual certificate obeying ||"P T x(A)|| < 1/2 
and that VtV^Vt h \t (both these conditions are true 
with very large probability under the assumptions of the 
noiseless recovery Theorems and |}. Then M obeys 

\\M — M\\p < 4 J — ^ - 21 5 + 26, (1113) 

with C p — 2 + p. 

For small values of p (recall this is the fraction 
of observed e ntries), the error is of course at most 

just about 4 ^ 2min(n 1 ,« 2 ) ^ Afj 

we will see from the 
proof, there is nothing special about 1/2 in the condition 
WPt^WW < I/ 2 - All we need is that there is a dual 
certificate obeying ||P T j.(A)|| < a for some a < 1 
(the value of a only influences the numerical constant 
in ( |HI.3| l). Further, when Z is random, (JhOJ holds on 
the event \\Vn(Z)\\ F < 6. 

Roughly speaking, our theorem states the following: 
when perfect noiseless recovery occurs, then matrix 
completion is stable vis a vis perturbations. To be sure, 
the error is proportional to the noise level 6; when 
the noise level is small, the error is small. Moreover, 
improving conditions under which noiseless recovery 
occurs, has automatic consequences for the more realistic 
recovery from noisy samples. 

A significant novelty here is that there is just no 
equivalent of this result in the compressed sensing or 
statistical literature for our matrix completion problem 
does not obey the restricted isometry property (RIP) [10]. 
For matrices, the RIP would assume that the sampling 
operator obeys 

(1 - 6)\\X\\l < i||^(A)|| 2 F < (1 + 6)\\X\\ F (IH.4) 

for all matrices X with sufficiently small rank and 8 < 1 
sufficiently small [26]. However, the RIP does not hold 
here. To see why, let the sampled set Q, be arbitrarily 
chosen and fix 4- ^- Th en tne rank-1 matrix e^* 
whose (i,j)th entry is 1, and vanishes everywhere else, 
obeys Tn^e*) = 0. Clearly, this violates ( |III4) . 

It is nevertheless instructive to compare ( |III.3[ ) with the 
bound one would achieve if the RIP ( |III.4[ ) were true. In 
this case, [18] would give 

\\M-M\\ F < C Q p- 1/2 6 



for some numerical constant Co. That is, an esti- 
mate which would be better by a factor proportional 
to 1 / -\/niin(7ii, 712). It would be interesting to know 
whether or not estimates, which are as good as what 
is achievable under the RIP, hold for the RIPless matrix 
completion problem. We will return to such comparisons 
later (Section |III-B| ). 

We close this section by emphasizing that our methods 
are also applicable to sparse signal recovery problems in 
which the RIP does not hold. 

A. Proof of Theorem [7] 

We use the notation of the previous section, and begin 
the proof by observing two elementary properties. The 
first is that since M is feasible for ( |IH.2) , we have the 
cone constraint 

\\M\U < \\m\U. (IH.5) 

The second is that the triangle inequality implies the tube 
constraint 

\\Vn(M ~ M)\\ F < \\V n (M -Y)\\ F + \\V a (Y - M)\\ F 
< 26, (111.6) 

since M is feasible. We will see that under our hypothe- 
ses, ( |III.5| ) and < |III.6| imply that M is close to M. Set 
M = M + H and put H Q := Vq(H), Hq* := Vqo(H) 
for short. We ne ed to bound ||iT||^ = \\H n \\ 2 F +\\H n 4 F , 
and since ( |III.6[ ) gives ||-ffn||F < 2 6, it suffices to bound 
||-ffnc|| F . Note that by the Pythagorean identity, we have 

\\Hn4% - \\V T (H n c)\\ F + \\V t ±( h &)\\f, CHI.7) 

and it is thus sufficient to bound each term in the right 
hand-side. 

We start with the second term. Let A be a dual 
certificate obeying HT^^A)!! < 1/2, we have 

\\M + H\U>\\M + H Q c\\*-\\H Q \U 

and 

||M + flh«||, > ||M|U + [l-||7' T x(A)||]||P T x(fr n 0ll*. 

The second inequality follows from Lemma|4] Therefore, 
with HT't^A)!! < 1/2, the cone constraint gives 

||M||, > ||M||, + i||P T x(H n .)||.-||irn||., 
or, equivalently, 

\\V T x(H^)\U<2\\H n \U. 
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Since the nuclear norm dominates the Frobenius norm, 

\\V T ±(Hno)\\ F < we have 

\\V T ±(Ha.)\\ F <2\\H a \U 

< 2y/n\\H n \\ F < 4y/nS, (III.8) 

where the second inequality follows from the Cauchy- 
Schwarz inequality, and the last from ( |III.6| l. 

To develop a bound on ||7'r(-H'n)||Fi observe that the 
assumption VtPqPt h %1 together with V\ = Vt> 

\\VnP T {H n c)f F = (VnV T (H n o),VnV T (H n .)) 
P, 



>^\\r T (H n .)f F . 



But since Vq(Hqc) 
VnV T ^{H n a), we have 







\\VaPT(H n c)\\ F = \\VnP T ^H n c)\\ F 
< \\V T ±(H^)\\ F . 

Hence, the last two inequalities give 

\\V T (H^)\\ 2 F < -WPnV T (H^)\\ 2 F < - \\V T ±(H n «)\\ F . 



(111.9) 



P P 

As a consequence of this and ( |1II.7| |, we have 

'2 
'P 

The theorem then follows from this inequality together 
with < fTTT~8> . 



\Hn4%< (~ + l)ll?V(#n=)|||. 



B. Comparison with an oracle 

We would like to return to discussing the best possible 
accuracy one could ever hope for. For simplicity, assume 
that ni = n% = n, and suppose that we have an oracle 
informing us about T. In many ways, going back to the 



discussion from Section II-A this is analogous to giving 



away the support of the signal in compressed sensing 
[11]. With this precious information, we would know 
that M lives in a linear space of dimension 2nr — r 2 
and would probably solve the problem by the method of 
least squares: 



minimize \\Vn(X) - Vn(Y)\\ F 
subject to X e T. 



(III. 10) 



That is, we would find the matrix in T, which best 
fits the data in a least-squares sense. Let A : T — > ft 
(we abuse notations and let be the range of Vq) 
defined by A := VqVt- Then assuming that the operator 
A* A = VtPuVt mapping T onto T is invertible 



(which is the case under the hypotheses of Theorem 17), 
the least-squares solution is given by 



M' 



Oracle 



Hence, 



I AT 



Oracle 



= (A*A)- 1 A*(Y) 

M + (A*Ay 1 A*(Z). (lll.ll) 
-M\\ F = \\{A*A)- l A*{Z)\\ F . 



Let Z' be the minimal (normalized) eigenvector of 
A* A with minimum eigenvalue A m ; n , and set Z = 
SX m i n /2 A(Z') (note that by definition V n {Z) = Z since 
Z is in the range of ^4). By construction, \\Z\\ F = S, 
and 

\\{A*A)- l A*{Z)\\ F = X^6>p- x /H 

since by assumption, all the eigenvalues of A* A — 

VtVqVt lie in the interval [p/2, 3f>/2]. The matrix 

Z defined above also maximizes 

among all matrices bounded by S and so the oracle 

achieves 

'p- 1/2 S (111.12) 



, M Oracle _ M |, 



with adversarial noise. Consequently, our analysis looses 
a y/n factor vis a vis an optimal bound that is achievable 
via the help of an oracle. 

The diligent reader may argue that the least-squares 
solution above may not be of rank r (it is at most 
of rank 2r) and may thus argue that this is not the 
strongest possible oracle. However, as explained below, 
if the oracle gave T and r, then the best fit in T of rank 
r would not do much better than ( |III.12[ ). In fact, there 
is an elegant way to understand the significance of this 
oracle which we now present. Consider a stronger oracle 
which reveals the row space of the unknown matrix M 
(and thus the rank of the matrix). Then we would know 
that the unknown matrix is of the form 

M = M C R*, 

where Mc is an n x r matrix, and R is an n x r matrix 
whose columns form an orthobasis for the row space 
(which we can build since the oracle gave us perfect 
information). We would then fit the nr unknown entries 
by the method of least squares and find X E W lXT 
minimizing 

\\Vn(XR*)-Vn(Y)\\ F . 

Using our previous notations, the oracle gives away T C 
T where Tq is the span of elements of the form yv^, 
k G [r], and is more precise. If Aq : To — * is defined 
by Aa := VqVto, then the least-squares solution is now 

(A* A )- l Al(Y). 
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n 


100 200 500 1000 


RMS error 


.99 .61 .34 .24 



TABLE I: RMS error (\\M - M\\ F /ri) as a 
function of n when subsampling 20% of an 
n x n matrix of rank two. Each RMS error is 
averaged over 20 experiments. 



Because all the eigenvalues of AqAq belong to 
[A m in(*4*-4), X max (A*A)], the previous analysis applies 
and this stronger oracle would also achieve an error of 
size about p~ x l 2 b. In conclusion, when all we know 
is ||7 , nC^)||.F < 5, one cannot hope for a root-mean 
squared error better than p~ 1 / 2 8. 

Note that when the noise is stochastic, e. g. when 
Zij is white noise with standard deviation cr, the oracle 
gives an error bound which is adaptive, and is smaller 
as the rank gets smaller. Indeed, E \\(A* A)^ A* {Z)\\ 2 F 
is equal to 

„2 



cr 2 trace((„4*.4)~ 1 ) 



2nr 



r 



2nr 



-a 



-cr, (111.13) 



p p 

2 eigenvalues of (A* A)~ l are just 



since all the 2nr — r 
about equal to p . When nr <C m, this is better than 
( pLT2l ). 

IV. Numerical Experiments 

We have seen that matrix completion is stable amid 
noise. To emphasize the practical nature of this result, a 
series of numerical matrix completion experiments were 
run with noisy data. To be precise, for several values 
of the dimension n (our first experiments concern nx n 
matrices), the rank r, and the fraction of observed entries 
p = m/n 2 , the following numerical simulations were 
repeated 20 times, and the errors averaged. A rank-r 
matrix M is created as the product of two rectangular 
matrices, M = M L M* R , where the entries of Ml , M R e 
E" xr are iid N(0,al : = 20/Vnf] The sampled set n 
is picked uniformly at random among all sets with m 
entries. The observations Vn{Y) are corrupted by noise 
as in ( |ni.l) , where {Z^} is iid N(0, a 2 ); here, we take 
<T=1. Lastly, M is recovered as the solution to ( |IV. 1 \ 
below. 

For a peek at the results, consider Table [I] The RMS 
error defined as \\M — M\\p/n, measures the root-mean 
squared error per entry. From the table, one can see 

3 The value of a n is rather arbitrary. Here, it is set so that the singular 
values of M are quite larger than the singular values of V^i{Z) so 
that M can be distinguished from the null matrix. Having said that, 
note that for large n and small r, the entries of M are much smaller 
than those of the noise, and thus the signal appears to be completely 
buried in noise. 



that even though each entry is corrupted by noise with 
variance 1, when M is a 1000 by 1000 matrix, the RMS 
error per entry is .24. To see the significance of this, 
suppose one had the chance to see all the entries of the 
noisy matrix Y = M + Z. Naively accepting Y as an 
estimate of M would lead to an expected MS error of 
K\\Y - M\\ 2 F /n 2 = K\\Z\\ 2 F /n 2 = 1, whereas the RMS 
error achieved from only viewing 20% of the entries is 
\\M - M\\ 2 F /n 2 = .24 2 = .0576 when solving the SDP 
( |IV.l| l! Not only are we guessing accurately the entries 
we have not seen, but we also "denoise" those we have 
seen. 

In order to stably recover M from a fraction of noisy 
entries, the following regularized nuclear norm mini- 
mization problem was solved using the FPC algorithm 
from [24], 



minimize h\V n (X - Y)\\ 2 F + fi\\X\ 



(IV. 1) 



It is a standard duality result that (JTyTJi is equivalent to 
( |III.2| i, for some value of [i, and thus one could use ( |I V. 1 [ ) 
to solve ( |III.2| i by searching for the value of giving 
\\Vn(M-Y)\\ F = S (assuming \\Vn(Y)\\ F > 6). We use 
( |I V. 1 \ because it works well in practice, and because the 
FPC algorithm solves ( |I V. 1 \ nicely and accurately. We 
also remark that a variation on our stability proof could 
also give a stable error bound when using the SDP ftV.l) . 

It is vital to choose a suitable value of /i, which we do 
with the following heuristic argument: first, simplifying 
to the case when £1 is the set of all elements of the 
matrix, note that the solution of ( |IV. 1 1 is equal to Y 
but with singular values shifted towards zero by fi 
(soft-thresholding), as can be seen from the optimality 
conditions of Section [II] by means of subgradients, or 
see [7]. When £1 is not the entire set, the solution is 
no longer exactly a soft-thresholded version of Y, but 
experimentally, it is generally close. Thus, we want to 
pick fj, large enough to threshold away the noise (keep 
the variance low), and small enough not to overshrink the 
original matrix (keep the bias low). To this end, /j, is set 
to be the smallest possible value such that if M = and 
Y = Z, then it is likely that the minimizer of ( |I V. 1 [ ) 
satisfi es M = 0. It can be seen that the solution to 
( IV. 1 [ > is M = if ||Pn(Y)|| < /x ( once a g ain , check 
the subgradient or [7]). Then the question is: what is 
llT^si^)!!? If we make a nonessential change in the way 
il is sampled, then the answer follows from random 
matrix theory. Rather than picking il uniformly at ran- 
dom, choose f2 by selecting each entry with probability 
p, independently of the others. With this modification, 
each entry of Vn(Z) is iid with variance pa 2 . Then if 
Z S R nxn , it is known that n" 1 / 2 \\V n (Z)\\ -> v^2p~a, 
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almost surely as n — ► oo. Thus we pick /i = ^/2npa, 
where p = m/n 2 . In practice, this value of n seems 
to work very well for square matrices. For n\ x rii 
matrices, based on the same considerations, the proposal 
is n = {\fn\ + ^Jfi2)^fpa with p = m/(nin,2). 

In order to interpret our numerical results, they are 
compared to those achieved by the oracle, see Section 
III-B To this end, Figure [2]plots three curves for varying 



values of n,p, and r: 1) the RMS error introduced above, 
2) the RMS error achievable when the oracle reveals 
T, and the problem is solved using least squares, 3) 
the estimat ed ora cle root expected MS error derived 
i.e. yj&f/[n 2 p\ = y/df/m, where 
In our experiments, as n and m/df 
= 2, the RMS error of the nuclear norm 

m . 



III-B 



in Section 
df = r(2n 
increased, with r 

problem appeared to be fit very well by 1.68 \/df/ 
Thus, to compare the oracle error to the actual recovered 
error, we plotted the oracle errors times 1.68. We also 
note that in our experiments, the RMS error was never 
greater than 2.25 y/df/m. 

No one can predict the weather. We conclude the nu- 
merical section with a real world example. We retrieved 
from the website [1] a 366 x 1472 matrix whose entries 
are daily average temperatures at 1472 different weather 
stations throughout the world in 2008. Checking its SVD 
reveals that this is an approximately low rank matrix as 
expected. In fact, letting M be the temperature matrix, 
and calling AI 2 the matrix created by truncating the SVD 
after the top two singular values gives ||A/2||f/|| ||f = 
.9927. 

To test the performance of our matrix completion 
algorithm, we subsampl ed 30 % of M and then recovered 
an estimate, M, using (IV.l i. Note that this is a much 



different problem than those proposed earlier in this 
section. Here, we attempt to recover a matrix that is not 
exactly low rank, but only approximately. The solution 
gives a relative error of \\M — M\\f/\\M\\f = .166. 
For comparisoij^] exact knowledge of the best rank-2 
approximation achieves \\M 2 — M\\f/\\M\\f — .121. 
Here /i has been selected to give a good cross-validated 
error and is about 535. 



- recovery error using SDP 

- 1 .68*(oracle error) 
1.68-[(2nr-r 2 )/(pn 2 )]" 2 




20 40 60 




100 200 300 400 500 600 700 800 900 1000 



- recovery error using SDP 

- 1 .68*(oracle error) 

- 1 ,68'[(2nr - r z )/(pn 2 )] 



V. Discussion 

This paper reviewed and developed some new results 
about matrix completion. By and large, low-rank ma- 
trix recovery is a field in complete infancy abounding 
with interesting and open questions, and if the recent 

4 The number 2 is somewhat arbitrary here, although we picked it 
because there is a large drop-off in the size of the singular values after 
the second. If, for example, Mio is the best rank-10 approximation, 
then ||Mi - M|| F /||M|| F = .081. 



Fig. 2: Comparison between the recovery 
error, the oracle error times 1.68, and the 
estimated oracle error times 1.68. Each point 
on the plot corresponds to an average over 20 
trials. Top: in this experiment, n = 600, r — 2 
and p varies. The x-axis is the number of mea- 
surements per degree of freedom (df). Middle: 
n varies whereas r = 2, p = .2. Bottom: 
n = 600, r varies and p — .2. 
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avalanche of results in compressed sensing is any indica- 
tion, it is likely that this field will experience tremendous 
growth in the next few years. 

At an information theoretic level, one would like to 
know whether one can recover low-rank matrices from 
a few general linear functionals, i. e. from A(M) = b, 
where A is a linear map from R™i x ™2 — » R m . In this 
direction, we would like to single out the original result 
of Recht, Fazel and Parrilo [26] who showed — by lever- 
aging the techniques and proofs from the compressed 
sensing literature — that if each measurement is of the 
form (Ak,X), where A^ is an independent array of iid 
Gaussian variables (a la compressed sensing), then the 
nuclear norm heuristics recovers rank-r matrices from on 
the order of nr log n such randomized measurements. 

At a computational level, one would like to have 
available a suite of efficient algorithms for minimiz- 
ing the nuclear norm under convex constraints and, in 
general, for finding low-rank matrices obeying convex 
constraints. Algorithms with impressive performance in 
some situations have already been proposed [7], [24] 
but the computational challenges of solving problems 
with millions if not billions of unknowns obviously still 
require much research. 
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